Multiplexed imaging enables the simultaneous spatial profiling of dozens of biological molecules in tissues at single-cell resolution. Extracting biologically relevant information such as the spatial distribution of cell phenotypes from multiplexed tissue imaging data involves a number of computational tasks, including image segmentation, feature extraction, and spatially resolved single-cell analysis. Here, we present an end-to-end workflow for multiplexed tissue image processing and analysis, that integrates a number of previously developed computational tools to enable these tasks in a user-friendly and customizable fashion. For data quality assessment, we highlight the utility of napari-imc for interactively inspecting raw imaging data and the cytomapper R/Bioconductor package for image visualization in R. Raw data preprocessing, image segmentation and feature extraction are performed using the steinbock toolkit. We showcase two alternative approaches for segmenting cells based on supervised pixel classification and pre-trained deep learning models. The extracted single-cell data is then read, processed and analyzed in R. The protocol describes the use of community-established data containers, facilitating the application of R/Bioconductor packages for dimensionality reduction, single-cell visualization and phenotyping. We provide instructions to perform spatially resolved single-cell analysis, including community analysis, cellular neighborhood detection and cell-cell interaction testing using the imcRtools R/Bioconductor package. The workflow has been previously applied to imaging mass cytometry data, but can be easily adapted to other highly multiplexed imaging technologies. This protocol can be implemented by researchers with basic bioinformatics training, and analysis of the provided dataset can be completed within 5 to 6 hours. An extended version of the workflow can be accessed at https://bodenmillergroup.github.io/IMCDataAnalysis/.
We provide a small imaging mass cytometry (IMC) dataset as an example to demonstrate multi-channel image processing and spatially resolved single-cell data analysis. The data was generated as part of the Integrated iMMUnoprofiling of large adaptive CANcer patient cohorts (IMMUcan) project using the Hyperion imaging system (Standard BioTools, San Francisco, CA, USA). The data can be accessed at https://zenodo.org/record/7575859 and contains the following files:
We further provide a panel.csv file in steinbock format, which contains
metadata for each antibody/channel used in the experiment. In this file, the
channel column indicates the metal isotopes used to label each antibody. The
name column contains the protein targets of each antibody. The keep column
indicates which channels were extracted from raw data and thus retained during
downstream analysis. Importantly, the order of rows for which the keep column is
set to 1 matches the order of channels in the extracted multi-channel images.
Finally, the deepcell column indicates which channels were used (i.e.,
aggregated) during automatic image segmentation using DeepCell/Mesmer: nuclear
channels are set to 1, cytoplasmic channels are set to 2, and the
column is left empty for other/unused channels. The file can be accessed at
https://zenodo.org/record/7624451.
In addition, we provide SpatialExperiment objects that contain previously
gated cells. These cells form the ground truth cell phenotype labels for
classification-based phenotyping. The data can be accessed at
https://zenodo.org/record/7647079.
options(timeout = 10000)
dir.create("data/steinbock/raw", recursive = TRUE)
download.file("https://zenodo.org/record/7624451/files/panel.csv",
"data/steinbock/panel.csv")
download.file("https://zenodo.org/record/7575859/files/Patient1.zip",
"data/steinbock/raw/Patient1.zip")
download.file("https://zenodo.org/record/7575859/files/Patient2.zip",
"data/steinbock/raw/Patient2.zip")
download.file("https://zenodo.org/record/7575859/files/Patient3.zip",
"data/steinbock/raw/Patient3.zip")
download.file("https://zenodo.org/record/7575859/files/Patient4.zip",
"data/steinbock/raw/Patient4.zip")
download.file("https://zenodo.org/record/7575859/files/compensation.zip",
"data/compensation.zip")
unzip("data/compensation.zip", exdir="data", overwrite=TRUE)
unlink("data/compensation.zip")
download.file("https://zenodo.org/record/7575859/files/sample_metadata.csv",
destfile = "data/sample_metadata.csv")
download.file("https://zenodo.org/record/7647079/files/gated_cells.zip",
"data/gated_cells.zip")
unzip("data/gated_cells.zip", exdir="data", overwrite=TRUE)
unlink("data/gated_cells.zip")
Multiplexed tissue image processing typically comprises individual steps for image extraction and pre-processing, image segmentation, and single-cell data extraction. Here we highlight the deep learning-based segmentation approach using the steinbock framework.
The commands in this section have to be executed in the “Command Prompt” on Windows or the “Terminal” on Linux/Mac OS.
Instructions to install the containerized steinbock toolkit can be found in the steinbock online documentation. In particular, to run the steinbock container, Docker needs to be installed first (see online instructions). For this manuscript, we run steinbock using the following alias:
alias steinbock="docker run -v /path/to/IMCDataAnalysis/publiction/data/steinbock:/data -u $(id -u):$(id -g) ghcr.io/bodenmillergroup/steinbock:0.16.0"
A shell script for automatic execution of the commands can be seen at steinbock.sh.
In this protocol, data pre-processing refers to the extraction of multi-channel images from raw imaging data, and to preparing them for downstream processing. The required steps are dependent on the imaging technology; here, we showcase the pre-processing of raw IMC data which includes a hot pixel filtering step.
steinbock preprocess imc images --hpf 50
The step took 38 seconds.
Perform automatic deep learning-enabled single-cell segmentation using the pre-trained Mesmer neural network implemented in DeepCell. In the following command, channels will be min-max-normalized and mean-aggregated according to the deepcell column in the panel file.
steinbock segment deepcell --minmax
The step took 3 minutes.
For each image, extract the mean pixel intensity per cell and marker. The resulting cell-level intensity values are stored as separate CSV files (one file per image):
steinbock measure intensities
The step took 16 seconds.
For each image, extract standard morphological features (e.g., area, eccentricity) per cell. The resulting cell-level features are stored as separate CSV files (one file per image):
steinbock measure regionprops
The step took 17 seconds.
In each image, detect cells in close spatial proximity. Shown here is the construction of undirected spatial cell graphs by pixel expansion, where cells are considered neighboring if and only if they are “touching” after growing each cell by maximally 4 pixels in all directions. The resulting spatial cell graphs are stored as separate directed edge lists in CSV format (one file per image):
steinbock measure neighbors --type expansion --dmax 4
The step took 48 seconds.
The commands in this section must be executed in R and the RStudio software can facilitate execution.
Read in the spatially-resolved single-cell data into R using the imcRtools package. For the rest of the protocol we will continue with the steinbock generated data.
library(imcRtools)
spe <- read_steinbock("data/steinbock/")
The step took 0.46 minutes.
After reading in the single-cell data, the SpatialExperiment object needs to
be further processed. First, the column names are set based on the image name
and the cell identifier. The patient identifier and the region of interest (ROI)
identifier are saved in the object as well as the cancer type, which can be read
in from the provided data/sample_metadata.csv file. For easy access later on,
the channels containing biological variation are selected. Finally, the mean
pixel intensities per channel and cell are arsinh-transformed using a cofactor
of 1.
library(tidyverse)
colnames(spe) <- paste0(spe$sample_id, "_", spe$ObjectNumber)
# Read patient metadata
meta <- read_csv("data/sample_metadata.csv")
# Extract patient id and ROI id from sample name
spe$patient_id <- str_extract(spe$sample_id, "Patient[1-4]")
spe$ROI <- str_extract(spe$sample_id, "00[1-8]")
# Store cancer type in SPE object
spe$indication <- meta$Indication[match(spe$patient_id, meta$`Sample ID`)]
# Specify which channels to use for downstream analysis
rowData(spe)$use_channel <- !grepl("DNA|Histone", rownames(spe))
# Transform the counts
assay(spe, "exprs") <- asinh(counts(spe)/1)
The step took 0.02 minutes.
Read in multi-channel images as a CytoImageList container using the
cytomapper package.
library(cytomapper)
images <- loadImages("data/steinbock/img/")
## All files in the provided location will be read in.
channelNames(images) <- rownames(spe)
The step took 0.83 minutes.
Read in segmentation masks as a CytoImageList container.
masks <- loadImages("data/steinbock/masks/", as.is = TRUE)
## All files in the provided location will be read in.
The step took 0.01 minutes.
For downstream visualization and analysis tasks, additional metadata needs to be
added to the CytoImageList objects storing the multi-channel images and
segmentation masks. Here, individual images, segmentation masks and entries in
the SpatialExperiment object are matched via the sample_id entry.
# Extract patient id from image name
patient_id <- str_extract(names(images), "Patient[1-4]")
# Retrieve cancer type per patient from metadata file
indication <- meta$Indication[match(patient_id, meta$`Sample ID`)]
# Store patient and image level information in elementMetadata
mcols(images) <- mcols(masks) <- DataFrame(sample_id = names(images),
patient_id = patient_id,
indication = indication)
The step took 0 minutes.
Low signal spillover (also termed crosstalk) between two channels has been observed for IMC, MIBI, and multiplexed immunofluorescence imaging such as CyCIF. Spillover is defined as a small proportion of the signal of a neighboring channel that can be detected in the primary channel. As spillover is linear to the signal of the neighboring channel, it can be corrected by following a previously described compensation approach. The spillover correction presented here is characteristic to IMC, and the steps of the following section can be skipped when working with data generated by other multiplexed imaging technologies assuming an alternative way of spillover minimization/correction was performed.
Read in data from the spillover slide for channel-to-channel spillover correction. The experimental procedure to create and acquire a spillover slide can be seen at https://bodenmillergroup.github.io/IMCDataAnalysis/spillover-correction.html. As recommended by the CATALYST R/Bioconductor package, the pixel intensities are arsinh-transformed using a cofactor of 5.
sce <- readSCEfromTXT("data/compensation/")
## Spotted channels: Y89, In113, In115, Pr141, Nd142, Nd143, Nd144, Nd145, Nd146, Sm147, Nd148, Sm149, Nd150, Eu151, Sm152, Eu153, Sm154, Gd155, Gd156, Gd158, Tb159, Gd160, Dy161, Dy162, Dy163, Dy164, Ho165, Er166, Er167, Er168, Tm169, Er170, Yb171, Yb172, Yb173, Yb174, Lu175, Yb176
## Acquired channels: Ar80, Y89, In113, In115, Xe131, Xe134, Ba136, La138, Pr141, Nd142, Nd143, Nd144, Nd145, Nd146, Sm147, Nd148, Sm149, Nd150, Eu151, Sm152, Eu153, Sm154, Gd155, Gd156, Gd158, Tb159, Gd160, Dy161, Dy162, Dy163, Dy164, Ho165, Er166, Er167, Er168, Tm169, Er170, Yb171, Yb172, Yb173, Yb174, Lu175, Yb176, Ir191, Ir193, Pt196, Pb206
## Channels spotted but not acquired:
## Channels acquired but not spotted: Ar80, Xe131, Xe134, Ba136, La138, Ir191, Ir193, Pt196, Pb206
assay(sce, "exprs") <- asinh(counts(sce)/5)
The step took 0.11 minutes.
CRITICAL: The provided data of the spillover slide were specifically acquired for this dataset and cannot be applied to other datasets. It is essential to acquire a spillover slide for each set of antibodies used and ideally for each experiment.
Perform quality assessment of the spillover data by visualizing the median pixel intensity per channel and spotted metal.
plotSpotHeatmap(sce)
(optional) Perform pixel binning to increase median pixel intensity. This is only needed if pixel intensities are too low (median below ~200 counts).
sce2 <- binAcrossPixels(sce, bin_size = 10)
The step took 0.25 minutes.
Filter incorrectly assigned pixels. The following step uses functions provided
by the CATALYST package to “de-barcode” the pixels. Based on the intensity
distribution of all channels, pixels are assigned to their corresponding
barcode; here, this is the already known metal spot. This procedure identifies
pixels that cannot be robustly assigned to the spotted metal. Pixels of such
kind can be regarded as “noisy”, “background”, or “artifacts” that should be
removed prior to spillover estimation. The spotted channels (bc_key) need to
be specified. The general workflow for pixel de-barcoding is as follows:
library(CATALYST)
bc_key <- as.numeric(unique(sce$sample_mass))
bc_key <- bc_key[order(bc_key)]
sce <- assignPrelim(sce, bc_key = bc_key)
## Debarcoding data...
## o ordering
## o classifying events
## Normalizing...
## Computing deltas...
sce <- estCutoffs(sce)
sce <- applyCutoffs(sce)
The step took 0.17 minutes.
Observe the number of correctly and incorrectly assigned pixels per spot.
The heatmap depicts the number of pixels assigned to each spot on the
diagonal, incorrectly assigned pixels off diagonal, and pixels that could not be
assigned to a specific spot in the first row. For the current dataset, all
pixels were either assigned to the corresponding spot or labeled as unassigned.
In the following code chunk, we observe that between 10% and 55% of pixels could
not be assigned to their corresponding spot which is to be expected as the
spatial distribution of spotted antibodies on the spillover slide can be
heterogeneous. In the filterPixels function the minevents parameter specifies
the threshold under which correctly assigned pixel sets are excluded from
spillover estimation. The correct_pixels parameter indicates if pixels that were
assigned to masses other than the spotted mass should be excluded from spillover
estimation. The default values often result in sufficient pixel filtering;
however, if very few pixels (~100) are measured per spot, the minevents
parameter needs to be set to a lower threshold.
library(pheatmap)
cur_table <- table(sce$bc_id, sce$sample_mass)
# Visualize the correctly and incorrectly assigned pixels
pheatmap(log10(cur_table + 1),
cluster_rows = FALSE,
cluster_cols = FALSE)
# Compute the fraction of unassigned pixels per spot
cur_table["0",] / colSums(cur_table)
## 113 115 141 142 143 144 145 146 147 148 149
## 0.1985 0.1060 0.2575 0.3195 0.3190 0.3825 0.3545 0.4280 0.3570 0.4770 0.4200
## 150 151 152 153 154 155 156 158 159 160 161
## 0.4120 0.4025 0.4050 0.4630 0.4190 0.4610 0.3525 0.4020 0.4655 0.4250 0.5595
## 162 163 164 165 166 167 168 169 170 171 172
## 0.4340 0.4230 0.4390 0.4055 0.5210 0.3900 0.3285 0.3680 0.5015 0.4900 0.5650
## 173 174 175 176 89
## 0.3125 0.4605 0.4710 0.2845 0.3015
# Filter pixels that were assigned to a mass other than the spotted mass
sce <- filterPixels(sce, minevents = 40, correct_pixels = TRUE)
Compute and store the spillover matrix using the CATALYST package.
sce <- computeSpillmat(sce)
sm <- metadata(sce)$spillover_matrix
The step took 0.06 minutes.
Perform single-cell data compensation using the
CATALYST package. The compCytof
function corrects channel-to-channel spillover directly on the single-cell
intensities using the previously estimated spillover matrix. The isotope_list
variable needs to be extended by isotopes that are not contained in this list
provided by the CATALYST package. Visualization of marker intensities of
neighboring channels (e.g., Yb173 and Yb174) before and after correction can be
used to assess the spillover correction efficacy.
library(dittoSeq)
library(patchwork)
# Specify the channel_name entry for use with CATALYST
rowData(spe)$channel_name <- paste0(rowData(spe)$channel, "Di")
# Add the missing metal isotope to the isotope_list variable
isotope_list <- CATALYST::isotope_list
isotope_list$Ar <- 80
spe <- compCytof(spe, sm,
transform = TRUE, cofactor = 1,
isotope_list = isotope_list,
overwrite = FALSE)
# Visualize biaxial plots before/after compensation
before <- dittoScatterPlot(spe, x.var = "Ecad", y.var = "CD303",
assay.x = "exprs", assay.y = "exprs") +
ggtitle("Before compensation")
after <- dittoScatterPlot(spe, x.var = "Ecad", y.var = "CD303",
assay.x = "compexprs", assay.y = "compexprs") +
ggtitle("After compensation")
before + after
# Replace uncompensated assays
assay(spe, "counts") <- assay(spe, "compcounts")
assay(spe, "exprs") <- assay(spe, "compexprs")
assay(spe, "compcounts") <- assay(spe, "compexprs") <- NULL
The step took 0.22 minutes.
Perform channel-to-channel spillover correction on multi-channel images. To this end, the previously computed spillover matrix needs to be adjusted to only retain channels that are stored in the multi-channel images. By visualizing neighboring channels, spillover correction efficacy can be assessed.
# Use mass tags as channel names
channelNames(images) <- rowData(spe)$channel_name
# Adapt spillover matrix to retain channels contained in the multi-channel images
adapted_sm <- adaptSpillmat(sm, channelNames(images),
isotope_list = isotope_list)
## Compensation is likely to be inaccurate.
## Spill values for the following interactions
## have not been estimated:
## Ir191Di -> Ir193Di
## Ir193Di -> Ir191Di
# Perform image compensation
images_comp <- compImage(images, adapted_sm)
# Visualize neighboring channels before/ after compensation
plotPixels(images[5], colour_by = "Yb173Di",
image_title = list(text = "Yb173 (Ecad) - before",
position = "topleft"),
legend = NULL, bcg = list(Yb173Di = c(0, 4, 1)))
plotPixels(images[5], colour_by = "Yb174Di",
image_title = list(text = "Yb174 (CD303) - before",
position = "topleft"),
legend = NULL, bcg = list(Yb174Di = c(0, 4, 1)))
plotPixels(images_comp[5], colour_by = "Yb173Di",
image_title = list(text = "Yb173 (Ecad) - after",
position = "topleft"),
legend = NULL, bcg = list(Yb173Di = c(0, 4, 1)))
plotPixels(images_comp[5], colour_by = "Yb174Di",
image_title = list(text = "Yb174 (CD303) - after",
position = "topleft"),
legend = NULL, bcg = list(Yb174Di = c(0, 4, 1)))
# Switch back to using target names as channel names
channelNames(images_comp) <- rownames(spe)
The step took 11.94 minutes.
Outline cells on composite images for visual assessment of segmentation quality. For visualization purposes, we subset 3 images and outline all cells on composite images after channel normalization.
set.seed(20220118)
img_ids <- sample(seq_len(length(images_comp)), 3)
cur_images <- images_comp[img_ids]
# Normalize each channel between 0 and 1
cur_images <- cytomapper::normalize(cur_images, separateImages = TRUE)
# Clip channel intensities at 0 and 0.2
cur_images <- cytomapper::normalize(cur_images, inputRange = c(0, 0.2))
plotPixels(cur_images,
mask = masks[img_ids],
img_id = "sample_id",
missing_colour = "white",
colour_by = c("CD163", "CD20", "CD3", "Ecad", "DNA1"),
colour = list(CD163 = c("black", "yellow"),
CD20 = c("black", "red"),
CD3 = c("black", "green"),
Ecad = c("black", "cyan"),
DNA1 = c("black", "blue")),
image_title = NULL,
legend = list(colour_by.title.cex = 0.9,
colour_by.labels.cex = 0.9))
CRITICAL: This visualization approach is useful to identify missegmented cells
in images. It is recommended to visualize many images and zoom in to assess
segmentation quality. An image is sufficiently segmented, if individual nuclei
are outlined by their respective cell mask, without obvious clumping or
splitting of cells. Provided the measured markers are localized within the
nucleus or relatively homogeneously distributed within the cells, only the
nuclei of cells can be segmented to reduce lateral spillover from neighboring
cells. This can optionally be combined with pixel expansion using the steinbock utils expand command (see steinbock online documentation) to slightly enlarge
the area of cells captured by the cell masks.
Visualize the distribution of the cell area and filter out small cells.
dittoPlot(spe, var = "area",
group.by = "sample_id",
plots = "boxplot") +
ylab("Cell area") + xlab("")
spe <- spe[,spe$area >= 5]
CRITICAL: Multiplexed imaging of tissue sections typically does not capture whole cells, but measures slices of cells. Small cell slices, which may not accurately represent the corresponding cell, should therefore be excluded from the analysis. The selected threshold is dataset-specific and should be fine-tuned by visualizing the distribution of the cell area.
Visualize the image area covered by cells per image.
# Compute the fraction of image area covered by cells
cell_density <- colData(spe) %>%
as.data.frame() %>%
group_by(sample_id) %>%
# Compute the number of pixels covered by cells and the total number of pixels
summarize(cell_area = sum(area),
no_pixels = mean(width_px) * mean(height_px)) %>%
# Divide the total number of pixels by the number of pixels covered by cells
mutate(covered_area = cell_area / no_pixels)
# Visualize the image area covered by cells per image
ggplot(cell_density) +
geom_point(aes(sample_id, covered_area)) +
theme_minimal(base_size = 15) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 15)) +
ylim(c(0, 1)) +
ylab("% covered area") + xlab("")
CRITICAL: When selecting dense tissue regions for acquisition, we expect a high fraction of image pixels to be covered by cells. However, due to the presence of sparse tissue structures (e.g., vessels) or because of tissue loss, the covered fraction can be lower.
Visualize staining differences between samples for selected markers. Together with the visualization of cells in low dimensions, this figure indicates sample-to-sample differences in marker expression.
multi_dittoPlot(spe, vars = c("HLADR", "CD3", "Ecad", "PDGFRb"),
group.by = "patient_id", plots = c("ridgeplot"),
assay = "exprs")
## Picking joint bandwidth of 0.162
## Picking joint bandwidth of 0.0982
## Picking joint bandwidth of 0.151
## Picking joint bandwidth of 0.089
Visualize low-dimensional embeddings of single cells. Here, we use the scater package to compute a Uniform Manifold Approximation and Projection (UMAP) embedding and visualize cells in low-dimensional space.
library(scater)
set.seed(220225)
spe <- runUMAP(spe, subset_row = rowData(spe)$use_channel,
exprs_values = "exprs")
# Visualize the UMAP embedding and color cells by patient ID
dittoDimPlot(spe, var = "patient_id",
reduction.use = "UMAP", size = 0.2) +
ggtitle("Patient ID on UMAP")
The step took 0.68 minutes.
CRITICAL: Differences in marker distributions or non-overlapping samples on the UMAP visualization can indicate sample-to-sample differences in staining or biological differences. Staining differences can arise when samples were processed differently or stained with different batches of an antibody cocktail.
Perform batch correction to remove sample-to-sample differences.
Here, we use the fastMNN method of the
batchelor
package.
library(batchelor)
set.seed(220228)
# Perform batch correction
out <- fastMNN(spe, batch = spe$patient_id,
auto.merge = TRUE,
subset.row = rowData(spe)$use_channel,
assay.type = "exprs")
## Warning in check_numbers(k = k, nu = nu, nv = nv, limit = min(dim(x)) - : more
## singular values/vectors requested than available
## Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
## TRUE, : You're computing too large a percentage of total singular values, use a
## standard svd instead.
# Store corrected embeddings in SPE object
reducedDim(spe, "fastMNN") <- reducedDim(out, "corrected")
# Compute UMAP on corrected embeddings
spe <- runUMAP(spe, dimred= "fastMNN", name = "UMAP_mnnCorrected")
# Visualize corrected UMAP
dittoDimPlot(spe, var = "patient_id",
reduction.use = "UMAP_mnnCorrected", size = 0.2) +
ggtitle("Patient ID on UMAP after correction")
CRITICAL: The accuracy of batch correction needs to be visually assessed to
estimate whether biological variation - in addition to technical variation - was
removed. This can be done by coloring the UMAP embedding based on markers that
are known to be expressed in certain cell phenotypes. These markers should be
expressed in cells that cluster in the UMAP embedding. Other batch effect
correction approaches such as Seurat and harmony can be used to improve data
integration.
The step took 3.4 minutes.
Define cellular phenotypes. For this, single cells can be clustered (A) or cells can be labelled via classification (B).
Graph-based clustering is performed using functions from the
bluster and
scran
R/Bioconductor packages. Alternatively, other approaches such as phenograph or
FlowSOM can be used to cluster single cells (see
https://bodenmillergroup.github.io/IMCDataAnalysis/cell-phenotyping.html#clustering).
Estimate optimal clustering parameters for graph-based clustering on the
integrated cells after batch correction. We perform a sweep across possible
combinations of clustering parameters, including the number of nearest neighbors
to consider (k) and the edge weighting method (type). We keep the community
detection algorithm (cluster.fun) fixed, as the Louvain method is one of the
most commonly used algorithms for graph-based clustering. To assess cluster
stability, we compute the mean silhouette width across all cells and select the
cluster parameter combination with highest mean silhouette width.
library(bluster)
library(BiocParallel)
# Select the corrected cell embeddings for clustering
mat <- reducedDim(spe, "fastMNN")
# Perform the cluster sweep
combinations <- clusterSweep(mat, BLUSPARAM=SNNGraphParam(),
k=c(10L, 20L),
type = c("rank", "jaccard"),
cluster.fun = "louvain",
BPPARAM = SerialParam(RNGseed = 230214))
# Compute the average silhouette width per parameter combination
sil <- vapply(as.list(combinations$clusters),
function(x) mean(approxSilhouette(mat, x)$width), 0)
# Visualize the average silhouette width per parameter combination
ggplot(data.frame(method = names(sil),
sil = sil)) +
geom_point(aes(method, sil), size = 3) +
theme_classic(base_size = 15) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Cluster parameter combination") +
ylab("Average silhouette width")
The step took 4.92 minutes.
CRITICAL: For each dataset, parameter estimation should be performed independently. For large datasets, the function takes a long time to run, and - as an approximation - the dataset can be subsampled to speed up the cluster sweep.
Based on the selected parameters, cells are clustered using a graph-based
algorithm. As observed above, the parameter setting for k=20 and type=”rank”
should result in well-separated clusters. The cluster identifiers are then saved
in the SpatialExperiment object.
library(scran)
clusters <- clusterCells(spe,
use.dimred = "fastMNN",
BLUSPARAM = SNNGraphParam(k = 20,
cluster.fun = "louvain",
type = "rank"))
spe$nn_clusters <- clusters
The step took 1.26 minutes.
To annotate the individual clusters based on cell phenotypes contained within, the marker expression per cluster can be visualized in the form of a heatmap. For visualization purposes, 2000 cells are randomly selected from the dataset.
library(viridis)
set.seed(220619)
cur_cells <- sample(seq_len(ncol(spe)), 2000)
dittoHeatmap(spe[,cur_cells],
genes = rownames(spe)[rowData(spe)$use_channel],
assay = "exprs", scale = "none",
heatmap.colors = viridis(100),
annot.by = c("nn_clusters", "patient_id"))
After observing cluster-specific expression patterns, the individual clusters can be annotated. For example, cluster 11 shows high expression of the B cell marker CD20 and can therefore be labeled as “Bcell”.
cluster_celltype <- recode(spe$nn_clusters,
"1" = "Tumor_proliferating",
"2" = "Myeloid",
"3" = "Tumor",
"4" = "Tumor",
"5" = "Stroma",
"6" = "Proliferating",
"7" = "Myeloid",
"8" = "CD8",
"9" = "Plasma_cell",
"10" = "CD4",
"11" = "Bcell",
"12" = "Stroma",
"13" = "pDC")
spe$cluster_celltype <- cluster_celltype
Gate individual cell types based on their marker expression. For this, the
cytomapper package provides the cytomapperShiny function. Per image, cells
are gated based on their marker expression in a hierarchical fashion to define
expected cell types. The gated cells are then visualized as outlines on
pseudo-coloured composite images. Once the correct cells are labeled, they can
be downloaded as a SpatialExperiment object storing only the selected cells.
During download, the cell label can be specified, which is stored in the
cytomapper_CellLabel entry of the colData slot for later use in training a
classifier.
if (interactive()){
cytomapperShiny(object = spe, mask = masks, image = images_comp,
cell_id = "ObjectNumber", img_id = "sample_id")
}
CRITICAL: Due to imperfect segmentation, lateral spillover leads to incorrect expression levels for some markers. We therefore recommend gating cells that only express markers of their specific cell phenotype.
Read in files containing the gated cells and concatenate them into a single
SpatialExperiment object.
library(SingleCellExperiment)
label_files <- list.files("data/gated_cells",
full.names = TRUE, pattern = ".rds$")
# Read in individual files containing gated cells
spes <- lapply(label_files, readRDS)
# Concatenate the individual objects into one
concat_spe <- do.call("cbind", spes)
Remove cells that were labeled multiple times and reassign tumor cells. As a
result of the multi-step labeling approach, some cells may have been labeled
several times. In cases where one cell was labeled as both tumor and immune
cell, we keep the immune cell label, as these cells are most likely immune cells
residing within the tumor. All other cells that were labeled several times are
removed. Finally, the labels are stored in the main SpatialExperiment object.
We first define a helper function that returns cell labels for cells that were labeled only once.
filter_labels <- function(object,
label = "cytomapper_CellLabel") {
cur_tab <- unclass(table(colnames(object), object[[label]]))
cur_labels <- colnames(cur_tab)[apply(cur_tab, 1, which.max)]
names(cur_labels) <- rownames(cur_tab)
cur_labels <- cur_labels[rowSums(cur_tab) == 1]
return(cur_labels)
}
Next, this function is applied to retrieve all cells that were labeled once.
labels <- filter_labels(concat_spe)
The function is applied again to all cells while excluding the “Tumor” label column. This approach identifies those cells that were labeled once as tumor cells and once as immune cells.
cur_spe <- concat_spe[,concat_spe$cytomapper_CellLabel != "Tumor"]
non_tumor_labels <- filter_labels(cur_spe)
additional_cells <- setdiff(names(non_tumor_labels), names(labels))
Finally, the two sets of cell labels are merged and stored in the
SpatialExperiment object.
final_labels <- c(labels, non_tumor_labels[additional_cells])
# Transfer labels to SPE object
spe_labels <- rep("unlabeled", ncol(spe))
names(spe_labels) <- colnames(spe)
spe_labels[names(final_labels)] <- final_labels
spe$cell_labels <- spe_labels
Train a random forest classifier for cell type classification of unlabelled cells. The cells are first split into labelled and unlabelled cells. We then perform a 75/25 split of the labelled cells to select training and testing datasets, respectively. Based on the training dataset, we perform a 5-fold cross validation to tune the random forest model parameter.
library(caret)
# Split between labeled and unlabeled cells
lab_spe <- spe[,spe$cell_labels != "unlabeled"]
unlab_spe <- spe[,spe$cell_labels == "unlabeled"]
# Randomly split into train and test data
set.seed(221029)
trainIndex <- createDataPartition(factor(lab_spe$cell_labels), p = 0.75)
train_spe <- lab_spe[,trainIndex$Resample1]
test_spe <- lab_spe[,-trainIndex$Resample1]
# Define fit parameters for 5-fold cross validation
fitControl <- trainControl(method = "cv",
number = 5)
# Select the arsinh-transformed counts for training
cur_mat <- t(assay(train_spe, "exprs")[rowData(train_spe)$use_channel,])
# Train a random forest classifier
rffit <- train(x = cur_mat,
y = factor(train_spe$cell_labels),
method = "rf", ntree = 1000,
tuneLength = 5,
trControl = fitControl)
The step took 11 minutes.
Assess the classifier performance by computing the confusion matrix of the test
dataset. The confusionMatrix function compares the predicted cell labels to
the ground truth cell labels and computes a number of performance metrics. A
high sensitivity and a high specificity for each cell type label is to be
desired.
# Select the arsinh-transformed counts of the test data
cur_mat <- t(assay(test_spe, "exprs")[rowData(test_spe)$use_channel,])
# Predict the cell phenotype labels of the test data
cur_pred <- predict(rffit, newdata = cur_mat)
# Compute the confusion matrix between predicted and actual labels
confusionMatrix(data = cur_pred,
reference = factor(test_spe$cell_labels),
mode = "everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bcell BnTcell CD4 CD8 Myeloid Neutrophil Plasma_cell Stroma
## Bcell 186 2 0 0 0 0 6 0
## BnTcell 4 423 1 0 0 0 0 0
## CD4 0 0 163 0 0 2 3 2
## CD8 0 0 0 199 0 0 8 0
## Myeloid 0 0 2 1 437 0 0 0
## Neutrophil 0 0 0 0 0 30 0 0
## Plasma_cell 1 0 3 2 0 0 158 0
## Stroma 0 0 2 0 0 0 0 108
## Treg 0 0 0 0 0 0 3 0
## Tumor 4 0 1 3 0 1 1 0
## Reference
## Prediction Treg Tumor
## Bcell 0 1
## BnTcell 0 1
## CD4 0 5
## CD8 0 3
## Myeloid 0 0
## Neutrophil 0 0
## Plasma_cell 1 0
## Stroma 0 0
## Treg 89 2
## Tumor 0 1487
##
## Overall Statistics
##
## Accuracy : 0.9806
## 95% CI : (0.9753, 0.985)
## No Information Rate : 0.4481
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9741
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Bcell Class: BnTcell Class: CD4 Class: CD8
## Sensitivity 0.95385 0.9953 0.94767 0.97073
## Specificity 0.99714 0.9979 0.99622 0.99650
## Pos Pred Value 0.95385 0.9860 0.93143 0.94762
## Neg Pred Value 0.99714 0.9993 0.99716 0.99809
## Precision 0.95385 0.9860 0.93143 0.94762
## Recall 0.95385 0.9953 0.94767 0.97073
## F1 0.95385 0.9906 0.93948 0.95904
## Prevalence 0.05830 0.1271 0.05142 0.06129
## Detection Rate 0.05561 0.1265 0.04873 0.05949
## Detection Prevalence 0.05830 0.1283 0.05232 0.06278
## Balanced Accuracy 0.97549 0.9966 0.97195 0.98361
## Class: Myeloid Class: Neutrophil Class: Plasma_cell
## Sensitivity 1.0000 0.909091 0.88268
## Specificity 0.9990 1.000000 0.99779
## Pos Pred Value 0.9932 1.000000 0.95758
## Neg Pred Value 1.0000 0.999095 0.99340
## Precision 0.9932 1.000000 0.95758
## Recall 1.0000 0.909091 0.88268
## F1 0.9966 0.952381 0.91860
## Prevalence 0.1306 0.009865 0.05351
## Detection Rate 0.1306 0.008969 0.04723
## Detection Prevalence 0.1315 0.008969 0.04933
## Balanced Accuracy 0.9995 0.954545 0.94024
## Class: Stroma Class: Treg Class: Tumor
## Sensitivity 0.98182 0.98889 0.9920
## Specificity 0.99938 0.99846 0.9946
## Pos Pred Value 0.98182 0.94681 0.9933
## Neg Pred Value 0.99938 0.99969 0.9935
## Precision 0.98182 0.94681 0.9933
## Recall 0.98182 0.98889 0.9920
## F1 0.98182 0.96739 0.9927
## Prevalence 0.03288 0.02691 0.4481
## Detection Rate 0.03229 0.02661 0.4445
## Detection Prevalence 0.03288 0.02810 0.4475
## Balanced Accuracy 0.99060 0.99368 0.9933
Predict cell labels of unlabelled cells. Cells for which the highest class probability is below 40% are labeled as “unknown”.
# Select the arsinh-transformed counts of the unlabeled data for prediction
cur_mat <- t(assay(unlab_spe, "exprs")[rowData(unlab_spe)$use_channel,])
# Predict the cell phenotype labels of the unlabeled data
cell_class <- as.character(predict(rffit,
newdata = cur_mat,
type = "raw"))
names(cell_class) <- rownames(cur_mat)
# Extract prediction probabilities for each cell
cell_prob <- predict(rffit,
newdata = cur_mat,
type = "prob")
# Label cells with a low prediction probability as undefined
cell_class[rowMax(as.matrix(cell_prob)) < 0.4] <- "undefined"
# Store cell phenotype labels in SPE object
cell_labels <- spe$cell_labels
cell_labels[colnames(unlab_spe)] <- cell_class
spe$celltype <- cell_labels
CRITICAL: The threshold for labeling cells as undefined needs to be adjusted per dataset. When predicting more cell classes, the threshold will likely be lower while the threshold can be increased when predicting fewer classes.
Visualize cell phenotype and annotated cluster labels on the UMAP embedding to qualitatively assess the cell phenotyping.
p1 <- dittoDimPlot(spe, var = "celltype",
reduction.use = "UMAP_mnnCorrected", size = 0.2,
do.label = TRUE) +
ggtitle("Cell types on UMAP, integrated cells")
p2 <- dittoDimPlot(spe, var = "cluster_celltype",
reduction.use = "UMAP_mnnCorrected", size = 0.2,
do.label = TRUE) +
ggtitle("Cluster cell types on UMAP, integrated cells")
p1 + p2
Visualize mean marker expression per cell type and per annotated cluster as heatmaps.
library(scuttle)
# Calculate the mean of the arsinh-transformed counts per cell phenotype
celltype_mean <- aggregateAcrossCells(as(spe, "SingleCellExperiment"),
ids = spe$celltype,
statistics = "mean",
use.assay.type = "exprs",
subset_row = rowData(spe)$use_channel)
dittoHeatmap(celltype_mean,
assay = "exprs", cluster_cols = TRUE,
scale = "none",
heatmap.colors = viridis(100),
annot.by = c("celltype", "ncells"))
# Calculate the mean of the arsinh-transformed counts per annotated cluster
cluster_mean <- aggregateAcrossCells(as(spe, "SingleCellExperiment"),
ids = spe$cluster_celltype,
statistics = "mean",
use.assay.type = "exprs",
subset_row = rowData(spe)$use_channel)
dittoHeatmap(cluster_mean,
assay = "exprs", cluster_cols = TRUE,
scale = "none",
heatmap.colors = viridis(100),
annot.by = c("cluster_celltype", "ncells"))
CRITICAL: We expect individual cell phenotypes to show the highest mean expression of their specific markers. In addition cell phenotypes should be visually separated on low-dimensional embeddings such as UMAP.
Perform spatial community analysis as proposed by Jackson et al. using the
detectCommunity function provided by the imcRtools package. This method groups
cells solely based on their location in the tissue by using a previously
constructed spatial cell graph. We perform community detection separately for
tumor and non-tumor cells.
# Define if cells are part of the tumor or stroma
spe$tumor_stroma <- ifelse(spe$celltype == "Tumor", "Tumor", "Stroma")
# Detect spatial communities
spe <- detectCommunity(spe,
colPairName = "neighborhood",
size_threshold = 10,
group_by = "tumor_stroma",
BPPARAM = SerialParam(RNGseed = 220819))
# Visualize spatial tumor communities
plotSpatial(spe[,spe$celltype == "Tumor"],
node_color_by = "spatial_community",
img_id = "sample_id",
node_size_fix = 0.5) +
theme(legend.position = "none") +
scale_color_manual(values = rev(colors()))
The step took 0.13 minutes.
After detecting spatial communities, a number of downstream analyses can be performed. These include (i) computing the cell phenotype fraction per community for meta clustering, (ii) identifying shared communities between samples for comparative analysis, and (iii) computing the size of the individual communities to estimate tissue segregation.
Perform cellular neighborhood (CN) analysis. CNs are tissue regions with
characteristic cell type composition and represent sites of unique local
biological processes and interactions. We first construct a spatial 20-nearest
neighbor graph based on the cells’ centroids, to include a larger number of
neighbors compared to the spatial cell graph constructed earlier using
steinbock. The aggregateNeighbors function of the imcRtools package then
computes for each cell the cell type fraction among its 20-nearest neighbors.
Cells are subsequently clustered using k-means clustering to detect CNs.
Finally, cells can be spatially visualized and colored based on their CN, and
the cell type fraction per CN can be computed. Here, CN 3 represents a spatial
aggregation of B cells while CN 1 contains mostly tumor tissue.
# Construct a 20-nearest neighbor graph
spe <- buildSpatialGraph(spe, img_id = "sample_id", type = "knn", k = 20)
## The returned object is ordered by the 'sample_id' entry.
# Compute the fraction of cell phenotypes in each cells neighborhood
spe <- aggregateNeighbors(spe, colPairName = "knn_interaction_graph",
aggregate_by = "metadata", count_by = "celltype")
set.seed(220705)
# Cluster cells based on cell phenotype fractions
cn_1 <- kmeans(spe$aggregatedNeighbors, centers = 6)
spe$cn_celltypes <- as.factor(cn_1$cluster)
# Visualize cellular neighborhoods
plotSpatial(spe,
node_color_by = "cn_celltypes",
img_id = "sample_id",
node_size_fix = 0.5) +
scale_color_brewer(palette = "Set3")
# Compute the fraction of cell phenotypes in each cellular neighborhood
mat <- prop.table(table(spe$cn_celltypes, spe$celltype),
margin = 1)
# Visualize cell phenotype fractions in each cellular neighborhood
pheatmap(mat,
color = colorRampPalette(c("dark blue", "white", "dark red"))(100),
scale = "column")
The step took 0.28 minutes.
CRITICAL: A parameter sweep should be performed to estimate the optimal value
for k. However, with prior knowledge on the expected tissue structures to
detect, k can be set accordingly. Alternatively, cells can also be clustered
based on the mean marker expression of cells within their neighborhood. In
addition, the
lisaClust
R/Bioconductor package provides alternative strategies to detect CNs.
Perform spatial context (SC) analysis as proposed by Bhate et al. SCs build upon
the concept of CNs and are regions in which the local biological processes of
CNs interact, and where specialized biological events take place. We construct a
second k-nearest neighbor graph with larger k (k=40) to include cells across a
length scale on which biological signals could be exchanged. For each cell, the
aggregateNeighbors function computes the fraction of CNs among its 40 nearest
neighbors. The detectSpatialContext function sorts the CN fractions from high
to low, and the SC of each cell is assigned as the minimal combination of CNs
that additively surpass a user-defined threshold (here: threshold=0.9). After
filtering the detected SCs, we can spatially visualize them and represent SC
interactions as a layered graph. We observe that SC 1 (Tumor-dominated), SC 1_6
(Tumor and Tumor-Stroma interface) and SC 4_5 (Plasma/T cell and
Myeloid/Neutrophil interface) are the most frequent SCs in this dataset.
# Construct a 40-nearest neighbor graph
spe <- buildSpatialGraph(spe, img_id = "sample_id",
type = "knn",
name = "knn_spatialcontext_graph",
k = 40)
## The returned object is ordered by the 'sample_id' entry.
# Compute the fraction of cellular neighborhoods around each cell
spe <- aggregateNeighbors(spe,
colPairName = "knn_spatialcontext_graph",
aggregate_by = "metadata",
count_by = "cn_celltypes",
name = "aggregatedNeighborhood")
# Detect spatial contexts
spe <- detectSpatialContext(spe,
entry = "aggregatedNeighborhood",
threshold = 0.90,
name = "spatial_context")
# Filter out small and infrequent spatial contexts
spe <- filterSpatialContext(spe,
entry = "spatial_context",
group_by = "patient_id",
group_threshold = 3,
cells_threshold = 100,
name = "spatial_context_filtered")
# Visualize cell location colored by their spatial context
plotSpatial(spe,
node_color_by = "spatial_context_filtered",
img_id = "sample_id",
node_size_fix = 0.5)
# Visualize spatial contexts as graph
plotSpatialContext(spe,
entry = "spatial_context_filtered",
group_by = "sample_id",
node_color_by = "n_cells",
node_size_by = "n_group",
node_label_color_by = "n_cells") +
scale_color_viridis()
The step took 0.51 minutes.
Perform patch detection analysis. The patchDetection function of the
imcRtools package detects fully connected components of cells of interest,
constructs a convex hull around each component, and expands this hull to include
neighboring cells. Below, we detect connected tumor components made up of at
least 10 cells, and we slightly expand the convex hull to include cells within
the patch.
spe <- patchDetection(spe,
patch_cells = spe$celltype == "Tumor",
img_id = "sample_id",
expand_by = 1,
min_patch_size = 10,
colPairName = "neighborhood")
## The returned object is ordered by the 'sample_id' entry.
plotSpatial(spe,
node_color_by = "patch_id",
img_id = "sample_id",
node_size_fix = 0.5) +
theme(legend.position = "none") +
scale_color_manual(values = rev(colors()))